AMalik03_FinalHomeworkCode_03
Some of my best friends are Zombies…
Original Code Reflection
I enjoyed most of this homework, but some of it was a struggle. I really enjoyed trying to figure out how to make my own functions despite how frustrating it was. I struggled with question 6 because I misread the question and was trying to plot at 100 X 30 data points into a vector of vectors but realized I only needed the means of each random sample. I also struggled with finding a way to add to vectors. I used the internet and found the function “append” to make changes to a vector. Finally I needed to refresh my memory on how to analyze qq plots because I didn’t know how to use the graph. I reread Module 8 to grab a better understanding of quantile-quantile plots and its uses.I didn’t use the full capabilities of ggplot2 because I wanted to make sure I did the homework correctly. I plan on adding color and prettiness for the final submission.
Final Code Reflection
Ritika’s comments and review of my code really helped me improve and find errors in my homework assignment. I realized through her comments and my peer commentary on Jess’s code that I didn’t do Question 5 right. Additionally, in question 3, when a scatter plot between weight in relation to age, Ritika believes there is a weak positive correlation. I personally think there is no relation but I used the corr.test function from Module 12 to find the correlation coefficient and changed my answer because there is indeed a weak correlation. Finally, based on Jess’s code, I realized left and right skewed is not a distribution, it’s a description of the graph, so I had to change my explanations for question 4. I also focused my final code on clarifying my explanations as well as making all my plots pretty.
1. Calculate the population mean and standard deviation for each quantitative random variable (height, weight, age, number of zombies killed, and years of education).
## Using libcurl 7.79.1 with LibreSSL/3.3.6
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::parse_date() masks curl::parse_date()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
f <-curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/zombies.csv")
d <-read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d) #make sure it's set up right## id first_name last_name gender height weight zombies_killed
## 1 1 Sarah Little Female 62.88951 132.0872 2
## 2 2 Mark Duncan Male 67.80277 146.3753 5
## 3 3 Brandon Perez Male 72.12908 152.9370 1
## 4 4 Roger Coleman Male 66.78484 129.7418 5
## 5 5 Tammy Powell Female 64.71832 132.4265 4
## 6 6 Anthony Green Male 71.24326 152.5246 1
## years_of_education major age
## 1 1 medicine/nursing 17.64275
## 2 3 criminal justice administration 22.58951
## 3 1 education 21.91276
## 4 6 energy studies 18.19058
## 5 3 logistics 21.10399
## 6 4 energy studies 21.48355
## [1] 67.6301
## [1] 143.9075
## [1] 143.9075
## [1] 2.992
## [1] 2.996
pop.sd <- function(x){ #create a function for population SD
sqrt(
sum(
(x - mean(x))^2
)
/ length(x)
)
}
(height.sd <- pop.sd(d$height))## [1] 4.30797
## [1] 18.39186
## [1] 2.963583
## [1] 1.747551
## [1] 1.675704
2. Use {ggplot} to make boxplots of each of these variables by gender.
heightplot <- ggplot(data = d, aes(x = gender, y = height, fill = gender)) + geom_boxplot() + xlab("Gender") + ylab("Height(inches)") + labs(title = "Boxplot of Height by Gender") + theme(plot.title = element_text(hjust =0.5))
heightplotweightplot <- ggplot(data = d, aes(x = gender, y = weight, fill = gender)) + geom_boxplot() + xlab("Gender") + ylab("Weight(pounds)") + labs(title = "Boxplot of Weight by Gender") + theme(plot.title = element_text(hjust =0.5))
weightplotageplot <- ggplot(data = d, aes(x = gender, y = age, fill = gender)) + geom_boxplot() + xlab("Gender") + ylab("Age(years)") + labs(title = "Boxplot of Weight by Gender") + theme(plot.title = element_text(hjust =0.5))
ageplotkillsplot <- ggplot(data = d, aes(x = gender, y = zombies_killed, fill = gender)) + geom_boxplot() + xlab("Gender") + ylab("Zombies Killed") + labs(title = "Boxplot of Weight by Gender") + theme(plot.title = element_text(hjust =0.5))
killsploteduplot <- ggplot(data = d, aes(x = gender, y = years_of_education, fill = gender)) + geom_boxplot() + xlab("Gender") + ylab("Years of Education") + labs(title = "Boxplot of Weight by Gender") + theme(plot.title = element_text(hjust =0.5))
eduplot
## 3. Use {ggplot} to make scatterplots of height and weight in relation
to age. Do these variables seem to be related? In what way?
scatplotheight <- ggplot(data = d, aes(x = age, y = height)) + geom_point(colour = "brown4") + xlab("Age(years)") + ylab("Height(in)") + labs(title = "Relationship Between Age and Height") + theme(plot.title = element_text(hjust =0.5))
scatplotheight##
## Pearson's product-moment correlation
##
## data: d$age and d$height
## t = 26.905, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6109440 0.6829231
## sample estimates:
## cor
## 0.6483801
According to the scatterplot, there is a moderate positive
correlation between the ages and the height of the survivors because the
Pearson correlation coefficient is calculates to be 0.648.
scatplotweight <- ggplot(data = d, aes(x = age, y = weight)) + geom_point(colour = "maroon") + xlab("Age(years)") + ylab("Weight(lb)") + labs(title = "Relationship Between Age and Weight") + theme(plot.title = element_text(hjust =0.5))
scatplotweight##
## Pearson's product-moment correlation
##
## data: d$age and d$weight
## t = 10.682, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2635609 0.3748694
## sample estimates:
## cor
## 0.3203203
According to the scatter plot, there is a weak positive
relationship between the age of survivors and their weight because the
Pearson correlation coefficient is calculated to be 0.320.
4. Using histograms and Q-Q plots, check whether the quantitative variables seem to be drawn from a normal distribution. Which seem to be and which do not (hint: not all are drawn from the normal distribution)? For those that are not normal, can you determine from which common distribution they are drawn?
plot.hist.qq <- function(Variable){ #Made function to plot histogram and qqplot to make plotting easier
par(mfrow = c(1,2))
plot(hist(Variable, plot = FALSE), col = "maroon", main = c("Histogram of Variable")) #Plots a basic histogram of variable y
qqnorm(Variable, frame = FALSE, col = "brown") #Plots a qq graph of data points
qqline(Variable) #plots a qq line
}
plot.hist.qq(d$height)
Yes, based on the histogram and the qqplot, the height of the
survivors are drawn from a normal distribution curve.
Yes, based on the histogram and the qqplot, the weight of the
survivors are drawn from a normal distribution curve.
Yes, based on the histogram and the qqplot, the age of the survivors
are drawn from a normal distribution curve.
No, based on the graphs, the data does not stem from a normal
distributions. The graph can be described as right-skewed. I predict
that the data is pulled from a Poisson distribution.
No, the data for survivors’ years of education is not drawn from a
normal distribution. The graph can be described as right-skewed. I
predict that the data is pulled from a Poisson distribution.
5. Now use the sample() function to sample ONE subset of 30 zombie survivors (without replacement) from this population and calculate the mean and sample standard deviation for each variable. Also estimate the standard error for each variable, and construct the 95% confidence interval for each mean. Note that for the variables that are not drawn from the normal distribution, you may need to base your estimate of the CIs on slightly different code than for the normal.
## id first_name last_name gender height weight zombies_killed
## 1 298 Howard Martin Male 68.85221 150.7262 6
## 2 730 Emily Mendoza Female 65.49701 133.6932 4
## 3 177 Juan Barnes Male 62.33876 133.5535 3
## 4 439 Kimberly Graham Female 66.38030 130.4227 2
## 5 264 Craig Stephens Male 70.27466 137.8759 3
## 6 304 Robert Parker Male 73.60436 161.1474 3
## 7 963 Margaret Rice Female 66.41046 143.8121 2
## 8 315 Tammy Martin Female 65.69532 123.4174 5
## 9 562 Sharon Carr Female 70.76667 169.6664 4
## 10 165 Jane Graham Female 69.06732 140.7583 5
## 11 793 Amy Taylor Female 64.40630 113.9125 3
## 12 10 Russell Chavez Male 72.81769 168.5616 2
## 13 601 Judy Stewart Female 62.69921 120.4426 4
## 14 980 Melissa Gray Female 64.19924 135.2534 3
## 15 217 Fred Butler Male 69.85855 149.5380 0
## 16 300 Helen Cook Female 71.68438 157.6903 4
## 17 671 Lisa Harper Female 65.79459 134.8015 2
## 18 708 Gregory Gray Male 70.10855 145.2425 3
## 19 898 Sean Romero Male 69.87416 169.4200 5
## 20 294 Laura Ross Female 63.18053 127.1017 1
## 21 312 Paula Hall Female 60.34227 114.5168 2
## 22 879 Jennifer Daniels Female 68.95607 152.1361 6
## 23 281 Philip Carpenter Male 72.58839 147.1452 4
## 24 141 Ann Dunn Female 65.36311 143.3415 3
## 25 3 Brandon Perez Male 72.12908 152.9370 1
## 26 394 Adam Gonzales Male 79.48964 162.3697 1
## 27 766 Carlos Hart Male 69.70424 162.2236 2
## 28 764 Terry Myers Male 65.15692 137.0728 6
## 29 659 Jack Castillo Male 71.94457 158.1921 3
## 30 288 Juan Murray Male 73.70542 163.5918 6
## years_of_education major age
## 1 4 city planning 19.26587
## 2 2 criminal justice administration 22.05847
## 3 1 mechanical engineering 14.61176
## 4 5 philosophy 23.76154
## 5 3 animal husbandry 21.62270
## 6 3 economics 23.85202
## 7 3 economics 20.89564
## 8 4 economics 20.38294
## 9 1 psychology 21.12729
## 10 2 economics 21.46047
## 11 6 philosophy 18.06166
## 12 3 military strategy 21.50798
## 13 3 economics 16.60652
## 14 3 agricultural sciences 21.61279
## 15 7 communication 18.18100
## 16 5 philosophy 26.25881
## 17 3 military strategy 18.73178
## 18 5 animal husbandry 22.62593
## 19 2 culinart services 18.99324
## 20 5 environmental science 24.37250
## 21 3 animal husbandry 17.17613
## 22 2 education 22.93477
## 23 3 business administration 22.37798
## 24 2 physical education 15.65899
## 25 1 education 21.91276
## 26 4 physical education 26.11363
## 27 2 human services 17.81768
## 28 2 epidemiology 14.99672
## 29 1 business administration 22.77554
## 30 4 medicine/nursing 20.28010
## [1] 68.42967
## [1] 144.6855
## [1] 20.60117
## [1] 3.266667
## [1] 3.133333
## [1] 4.150541
## [1] 15.87423
## [1] 2.99947
## [1] 1.611073
## [1] 1.49963
## [1] 0.7577816
## [1] 2.898225
## [1] 0.5476258
## [1] 0.2941403
## [1] 0.2737937
xbar = NULL
sd= NULL
normbounds <- function(xbar, sem){
lower.bound <- xbar - qnorm(1-0.05/2) * sem
upper.bound <- xbar + qnorm(1-0.05/2) * sem
print(c(lower.bound, upper.bound))
}
normbounds(mean.height.s, sd.height.s)## [1] 60.29476 76.56458
## [1] 113.5725 175.7984
## [1] 14.72232 26.48003
##
## Exact Poisson test
##
## data: length(x) time base: 1
## number of events = 1, time base = 1, p-value = 1
## alternative hypothesis: true event rate is not equal to 1
## 95 percent confidence interval:
## 0.02531781 5.57164339
## sample estimates:
## event rate
## 1
##
## Exact Poisson test
##
## data: length(x) time base: 1
## number of events = 1, time base = 1, p-value = 1
## alternative hypothesis: true event rate is not equal to 1
## 95 percent confidence interval:
## 0.02531781 5.57164339
## sample estimates:
## event rate
## 1
6. Now draw 99 more random samples of 30 zombie apocalypse survivors, and calculate the mean for each variable for each of these samples. Together with the first sample you drew, you now have a set of 100 means for each variable (each based on 30 observations), which constitutes a sampling distribution for each variable. What are the means and standard deviations of this distribution of means for each variable? How do the standard deviations of means compare to the standard errors estimated in [5]? What do these sampling distributions look like (a graph might help here)? Are they normally distributed? What about for those variables that you concluded were not originally drawn from a normal distribution?
meanheights <- c(mean.height.s)
meanweights <- c(mean.weight.s)
meanages <- c(mean.age.s)
meankills <- c(mean.kills.s)
meanedus <- c(mean.edu.s)
for (i in 1:99) { #create a for loop that adds the mean vectors of the quantitative daae to a vector of means, a total of 99 loops for 100 samples.
rsample <- sample_n(d, size = 30, replace = FALSE)
meanheights <- append(meanheights, mean(rsample$height))
meanweights <- append(meanweights, mean(rsample$weight))
meanages <- append(meanages, mean(rsample$age))
meankills <- append(meankills, mean(rsample$zombies_killed))
meanedus <- append(meanedus, mean(rsample$years_of_education))
}
mean(meanheights)## [1] 67.55221
## [1] 0.7882015
## [1] 143.484
## [1] 3.166218
## [1] 20.05964
## [1] 0.5480569
## [1] 2.980667
## [1] 0.3160935
## [1] 2.977
The standard deviations are smaller from this population are
smaller than that of the one from Question 5. Also, he standard
deviation of the means are very close to the standard error
calculated
Based on the histograms and qq plots, all the quantitative data seem
to be drawn from a normal distribution.